Title

Home credit Default risk

Group number: 02

Team members: (clockwise in the picture)

  1. Kiran Kanrandikar(kikarand@iu.edu)
  2. Yashwitha Reddy(ypondug@iu.edu)
  3. Rahul(rgomathi@iu.edu)
  4. Sathish(satsoun@iu.edu) Screen Shot 2022-04-18 at 4.13.39 PM.png

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Mount Gdrive

Imports

Extract Zip Files, ignore if unzipped already

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

Load Application train


Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

One Click Setup | Imports | Load datasets

Below cells are redundant and are added to quickly load all datasets in events like kernel failure.

Exploratory Data Analysis

Summary of Application train

Missing data for application train

Distribution of the target column

Correlation with the target column

Applicants Age

Applicants years of employments

Interesting observation: DAYS_EMPLOYED : some rows have value as 365243( equivalent to 1000 years),i.e some people are employed for 1000 years

Applicants occupations

Applicants Income Type

Target vs Gender

Family Status of Loan applicants

Observation: The relationship between TARGET and "EXT_SOURCE_3", "EXT_SOURCE_2", "EXT_SOURCE_1", "DAYS_EMPLOYED" is not linear and monotonic.

Dataset : bureau

Observation: Bureau contains record for all SK_ID_CURR

Bureau: Applicants Days Credit

Applicants Credit history status

Observation: CREDIT_CURRENCY has 4 types but the data majorly contain only one type

Applicants with more than 5, 10, >15 Bureau records

Insights on aggregated data

Dataset: bureau_balance

Observation: we can use the status column to better understand the applicants re-payments behaviour per credit

Questions: What different status code indicates, do they have any siginificane??

Dataset: previous_application

Features can be used from previous_application when grouped by "SK_ID_CURR"

Design Question: How to navigate around situation where no previous_application exits for SK_ID_CURR

Dataset: credit_card_balance

Selecting random SK_ID_PREV to gain insights

Insights on aggregated data

Dataset: POS_CASH_BALANCE

Obserrving Random SK_ID_PREV to gain insights

Insights on aggregated data

Dataset : installments_payment

Obserrving Random SK_ID_PREV to gain insights

Aggreating data.

- [ ] Completed???

Dataset questions

Unique record for each SK_ID_CURR

Previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Design Decisions | Sample Examples | Sample Feature Engineering

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

Pandas dataframe aggegration detour

Aggregate using one or more operations over the specified axis.

For more details see agg

DataFrame.agg(func, axis=0, *args, **kwargs**)

Aggregate using one or more operations over the specified axis.

Multiple condition expressions in Pandas

So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.

You must use the following operators with pandas:

Sample Feature Engineering | previous_applications

Missing value Analayis

Sample Feature Enginnering | Previous_application analysis

Sample Feature Engineering | Using feature transformer...

Join the labeled dataset

Test Data Feature Engineering

Join the unlabeled dataset (i.e., the submission file)

Convert categorical features to numerical approximations (via pipeline)

Sample Processing pipeline | Known Issues

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

OHE case study: The breast cancer wisconsin dataset (classification)

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

bold textABSTRACT

Feature Engineering | Feature Selections | Preprocessing

Train, Valid, Test dataset selection

Feature Enginnering

Testing | Experimental Features

Todo's
Bureau Features
Bureau Balance
application
Credit card balance
Class based feature Transformer
POS_CASH_BALANCE
Installments payment

Data Pipeline

Secondary Tables Aggregation

Auxiliary classes

Modeling

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Gridsearch with CV

Screenshots of Experimental Analysis

Random forest

Screenshot 2022-04-20 185422.png

Screenshot 2022-04-20 185518.png

Screenshot 2022-04-20 190002.png

Decision Tree Classifier

Screenshot 2022-04-20 190203.png

Screenshot 2022-04-20 191929.png

Screenshot 2022-04-20 192436.png

XGBoost

Screenshot 2022-04-20 192653.png

Screenshot 2022-04-20 193450.png

Resampling

Screenshot 2022-04-20 193916.png

Random Forest after resampling

Screenshot 2022-04-20 194031.png

Screenshot 2022-04-20 194624.png

Screenshot 2022-04-20 194830.png

Decision tree after resampling

Screenshot 2022-04-20 195058.png

Screenshot 2022-04-20 195822.png

XGBoost after resampling

Screenshot 2022-04-20 194900.png

Screenshot 2022-04-20 195850.png

Screenshot 2022-04-20 195922.png

Evaluation metrics

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

AUC score

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Kaggle submission via the command line API

report submission

Click on this link

Model Evaluation

Write-up

For this phase of the project, you will need to submit a write-up summarizing the work you did. The write-up form is available on Canvas (Modules-> Module 12.1 - Course Project - Home Credit Default Risk (HCDR)-> FP Phase 2 (HCDR) : write-up form ). It has the following sections:

Abstract

In Phase 1 of our HCDR project we created a baseline model which was not accurate enough for predictions. So, in phase 2 we have improved our model by using various techniques. In Phase 02 of our project, we have focused on feature engineering and hyperparameter tuning. In addition to that we have also concentrated on feature selection, analysis of feature importance and ensemble methods. Firstly, we have done data aggregation by creating pipelines for all the secondary tables. The aggregated data is merged into the main table using pipeline. We have included imputation, scaling and normalizing in this process. Class base feature transformer is used for feature transformation. FeatureUnion is performed in order to combine the num_pipeline and cat_pipeline. A series of experiments are conducted to find the most important features. Finally we performed hyper parameter tuning on our models through gridsearch. The models we used in this phase are decision tree, random forest and XGBoost. Best model is determined by gridsearch on parameters of the model.

Project Description

In phase 1 of our project we Exploratory Data Analysis We one-hot encoded all the category features for Feature Engineering Built a baseline pipeline using logistic regression Accuracy score on held out test set : 91.59% AUC score: 0.7356 Training Time: 35.7s As you can see our workflow below,we concentrated on feature engineering and hyperparameter tweaking in Phase 2 of our project. Aside from that, we've focused on feature selection, feature importance analysis, and ensemble approaches. To begin, we created pipelines for all of the secondary tables to aggregate data. Pipeline is used to combine the aggregated data into the main table. This procedure includes imputation, scaling, and normalization. For feature transformation, a class base feature transformer is employed. In class based feature transformation we have included the following tables for feature engineering.

Screen Shot 2022-04-19 at 4.39.30 PM.png

For hyperparameter tuning we have don gridsearch cv. Then we used decision tree, random forest and XGBoost models. We performed gridsearch on parameters to determine the best model.

Introduction

Feature Engineering and transformers

We have used class based feature engineering. Bureau Features,Bureau Balance, application, credit card balance are used. We have used class based feature transformer.

Screen Shot 2022-04-19 at 5.04.05 PM.png Parameter tuning is done using gridsearch. we obtained the folowing results for our hyperparameter tuning

Screen Shot 2022-04-19 at 5.10.33 PM.png

Screen Shot 2022-04-19 at 5.10.55 PM.png

Screen Shot 2022-04-19 at 5.11.11 PM.png

Modelling Pipelines

Results and Discussion

The following results are obtained doing respective experiments

Screen Shot 2022-04-19 at 5.20.04 PM.png

Screen Shot 2022-04-20 at 6.33.14 PM.png

Conclusion

Wee have performed feature engineering and Hyperparameter tuning in this phase. We have also tried to improve our results using Decision Tree, andom Forest and XGBoost. The best model we have is XGBoost with a test accuracy of 0.8771 and the test accuracy of XGBoost after resampling is 0.8132. In phase 3 we are planning on developing the following :

Kaggle Submission

References

Some of the material in this notebook has been adopted from here

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: